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Abstract 

It  is  now  well  established  that  sparse  signal  models  are  well  suited  to  restoration 
tasks  and  can  effectively  be  learned  from  audio,  image,  and  video  data.  Recent  re¬ 
search  has  been  aimed  at  learning  discriminative  sparse  models  instead  of  purely 
reconstructive  ones.  This  paper  proposes  a  new  step  in  that  direction,  with  a  novel 
sparse  representation  for  signals  belonging  to  different  classes  in  terms  of  a  shared 
dictionary  and  multiple  discriminative  class  models.  The  linear  variant  of  the  pro¬ 
posed  model  admits  a  simple  probabilistic  interpretation,  while  its  most  general 
variant  admits  an  interpretation  in  terms  of  kernels.  An  optimization  framework 
for  learning  all  the  components  of  the  proposed  model  is  presented,  along  with 
experimental  results  on  standard  handwritten  digit  and  texture  classification  tasks. 


1  Introduction 

Sparse  and  overcomplete  image  models  were  first  introduced  in  [1]  for  modeling  the  spatial  recep¬ 
tive  fields  of  simple  cells  in  the  human  visual  system.  The  linear  decomposition  of  a  signal  using  a 
few  atoms  of  a  learned  dictionary,  instead  of  predefined  ones-such  as  wavelets-has  recently  led  to 
state-of-the-art  results  for  numerous  low-level  image  processing  tasks  such  as  denoising  [2],  show¬ 
ing  that  sparse  models  are  well  adapted  to  natural  images.  Unlike  principal  component  analysis 
decompositions,  these  models  are  most  ofen  overcomplete,  with  a  number  of  basis  elements  greater 
than  the  dimension  of  the  data.  Recent  research  has  shown  that  sparsity  helps  to  capture  higher- 
order  correlation  in  data:  In  [3,  4],  sparse  decompositions  are  used  with  predefined  dictionaries  for 
face  and  signal  recognition.  In  [5],  dictionaries  are  learned  for  a  reconstruction  task,  and  the  sparse 
decompositions  are  then  used  a  posteriori  within  a  classifier.  In  [6],  a  discriminative  method  is  in¬ 
troduced  for  various  classification  tasks,  learning  one  dictionary  per  class;  the  classification  process 
itself  is  based  on  the  corresponding  reconstruction  error,  and  does  not  exploit  the  actual  decompo¬ 
sition  coefficients.  In  [7],  a  generative  model  for  document  representation  is  learned  at  the  same 
time  as  the  parameters  of  a  deep  network  structure.  The  framework  we  present  in  this  paper  extends 
these  approaches  by  learning  simultaneously  a  single  shared  dictionary  as  well  as  multiple  models 
for  different  signal  classes  in  a  mixed  generative  and  discriminative  formulation  (see  also  [8],  where 
a  different  discrimination  term  is  added  to  the  classical  reconstructive  one  for  supervised  dictionary 
learning).  Similar  joint  generative/discriminative  frameworks  have  started  to  appear  in  probabilistic 
approaches  to  learning,  e.g.,  [9,  10,  11,  12,  13,  14],  but  not,  to  the  best  of  our  knowledge,  in  the 
sparse  dictionary  learning  framework.  Section  2  presents  the  formulation  and  Section  3  its  inter¬ 
pretation  in  term  of  probability  and  kernel  frameworks.  The  optimization  procedure  is  detailed  in 
Section  4,  and  experimental  results  are  presented  in  Section  5. 

2  Supervised  dictionary  learning 

We  present  in  this  section  the  core  of  the  proposed  model.  We  start  by  describing  how  to  per¬ 
form  sparse  coding  in  a  supervised  fashion,  then  show  how  to  simultaneously  learn  a  discrimina¬ 
tive/reconstructive  dictionary  and  a  classifier. 
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2.1  Supervised  Sparse  Coding 

In  classical  sparse  coding  tasks,  one  considers  a  signal  x  in  K"  and  a  fixed  dictionary  D  = 
[di, . . . ,  dfc]  in  (allowing  k  >  n,  making  the  dictionary  overcomplete).  In  this  setting,  sparse 
coding  with  an  £i  regularization'  amounts  to  computing 

TZ*{x,'D)=  min  ||a;  —  Dctjlo  +  Ai||q;||i.  (1) 

Q:eR'= 

It  is  well  known  in  the  statistics,  optimization,  and  compressed  sensing  communities  that  the  £i 
penalty  yields  a  sparse  solution,  very  few  non-zero  coefficients  in  a,  [15],  although  there  is  no 
explicit  analytic  link  between  the  value  of  Ai  and  the  effective  sparsity  that  this  model  yields.  Other 
sparsity  penalties  using  the  £o  regularization^  can  be  used  as  well.  Since  it  uses  a  proper  norm,  the 
£i  formulation  of  sparse  coding  is  a  convex  problem,  which  makes  the  optimization  tractable  with 
algorithms  such  as  those  introduced  in  [16,  17],  and  has  proven  in  our  proposed  framework  to  be 
more  stable  than  its  £q  counterpart,  in  the  sense  that  the  resulting  decompositions  are  less  sensitive 
to  small  perturbations  of  the  input  signal  x.  Note  that  sparse  coding  with  an  £q  penalty  is  an  NP-hard 
problem  and  is  often  approximated  using  greedy  algorithms. 

In  this  paper,  we  consider  a  different  setting,  where  the  signal  may  belong  to  any  of  p  different 
classes.  We  model  the  signal  x  using  a  single  shared  dictionary  D  and  a  set  of  p  decision  functions 
gi{x,  a,  9)  {i  =  1, . . .  ,p)  acting  on  x  and  its  sparse  code  a  over  D.  The  function  gi  should  be 
positive  for  any  signal  in  class  i  and  negative  otherwise.  The  vector  6  parametrizes  the  model  and 
will  be  jointly  learned  with  D.  In  the  following,  we  will  consider  two  kinds  of  models: 

(i)  linear  in  a.:  gfix,  a.,  0)  =  wf  a  +  bi,  where  6  —  {w^  G  bi  G  and  the  vectors 

{i  =  1, ...  ,p)  can  be  thought  of  as  p  linear  models  for  the  coefficients  a,  with  the  scalars  bi  acting 
as  biases; 

(ii)  bilinear  in  x  and  a:  gfix,  a.,  9)  —  x’^'WiOt  +  bi,  where  9  —  {Wi  s  bi  S  Note 

that  the  number  of  parameters  in  (ii)  is  greater  than  in  (i),  which  allows  for  richer  models.  One  can 
interpret  as  a  filter  encoding  the  input  signal  x  into  a  model  for  the  coefficients  a,  which  has  a 
role  similar  to  the  encoder  in  [18]  but  for  a  discriminative  task. 

Let  us  define  softmax  discriminative  cost  functions  as 

p 

Cfixi,...,xp)  = 

i=i 

for  i  =  1 , . . . ,  p.  These  are  multiclass  versions  of  the  logistic  function,  enjoying  properties  similar  to 
that  of  the  hinge  loss  from  the  SVM  literature,  while  being  differentiable.  Given  some  input  signal 
X  and  fixed  (for  now)  dictionary  D  and  parameters  9,  the  supervised  sparse  coding  problem  for  the 
class  p  can  be  defined  as  computing 

D,  0)  =  imn5i(a,  tc,  D,  0),  (2) 

where 

Si{oL,x,T>,9)  =  Ci{{gj{x,OL,9)Y^^.^)  +  Ao||a;  -  Dq:||2  +  Ai||q:||i.  (3) 

Note  the  explicit  incorporation  of  the  classification  and  discriminative  component  into  sparse  coding, 
in  addition  to  the  classical  reconstructive  term  (see  [8]  for  a  different  classification  component).  In 
turn,  any  solution  to  this  problem  provides  a  straightforward  classification  procedure,  namely: 

z*(a;,D,0)  =  argminiS*(a;,D,0).  (4) 

Compared  with  earlier  work  using  one  dictionary  per  class  [6],  this  model  has  the  advantage  of 
letting  multiple  classes  share  some  features,  and  uses  the  coefficients  a  of  the  sparse  representations 
as  part  of  the  classification  procedure,  thereby  following  the  works  from  [3,  4,  5],  but  with  learned 
representations  optimized  for  the  classification  task  similar  to  [8,  9].  As  shown  in  Section  3,  this 
formulation  has  a  straightforward  probabilistic  interpretation,  but  let  us  first  see  how  to  learn  the 
dictionary  D  and  the  parameters  9  from  training  data. 

2.2  SDL:  Supervised  Dictionary  Learning 

Let  us  assume  that  we  are  given  p  sets  of  training  data  Ti,  i  =  1, . . .  ,p,  such  that  all  samples  in  Ti 
belong  to  class  i.  The  most  direct  method  for  learning  D  and  9  is  to  minimize  with  respect  to  these 

'The  norm  of  a  vector  x  of  size  n  is  defined  as  |  |a;|  [i  =  l®[*]  I- 

^The  fo  pseudo-norm  of  a  vector  x  is  the  number  of  nonzeros  coefficients  of  x.  Note  that  it  is  not  a  norm. 
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Figure  1 :  Graphical  model  for  the  proposed  generative/discriminative  learning  framework. 


variables  the  mean  value  of  S*,  with  an  £2  regularization  term  to  prevent  overfitting: 

p 

min(^^5*(x„D,0))  +A2II0III,  s.t.  V i  =  1, . . . ,  A:,  ||d,||2  <  1.  (5) 

^  ^  i=i  jeTi 

Since  the  reconstruction  errors  ||a;  —  DQ:||f  are  invariant  to  scaling  simultaneously  D  by  a  scalar 
and  a  by  its  inverse,  constraining  the  £2  norm  of  columns  of  D  prevents  any  transfer  of  energy 
between  these  two  variables,  which  would  have  the  effect  of  overcoming  the  sparsity  penalty.  Such 
a  constraint  is  classical  in  sparse  coding  [2].  We  will  refer  later  to  this  model  as  SDL-G  (supervised 
dictionary  learning,  generative). 

Nevertheless,  since  the  classification  procedure  from  Eq.  (4)  will  compare  the  different  residuals  S* 
of  a  given  signal  for  i  =  1, ...  ,p,  a  more  discriminative  approach  is  to  not  only  make  the  S*  small 
for  signals  with  label  i,  as  in  (5),  but  also  make  the  value  of  S*  greater  than  S*  for  j  different  than 
i,  which  is  the  purpose  of  the  softmax  function  Ci.  This  leads  to: 

p 

s.t.  Vi  =  l,...,A:,  ||d,||2<l.  (6) 

As  detailed  below,  this  problem  is  more  difficult  to  solve  than  Eq.  (5),  and  therefore  we  adopt  in¬ 
stead  a  mixed  formulation  between  the  minimization  of  the  generative  Eq.  (5)  and  its  discriminative 
version  (6),  [12] — that  is, 

p 

+  +A2||0||^  s.t.  V*,  ||d,||2<l, 

i=i  jeTi 

(7) 

where  p  controls  the  trade-off  between  reconstruction  from  Eq.  (5)  and  discrimination  from  Eq.  (6). 
This  is  the  proposed  generative/discriminative  model  for  sparse  signal  representation  and  classi¬ 
fication  from  learned  dictionary  D  and  model  9.  We  will  refer  to  this  mixed  model  as  SDL-D, 
(supervised  dictionary  learning,  discriminative).  Before  presenting  the  proposed  optimization  pro¬ 
cedure,  we  provide  below  two  interpretations  of  the  linear  and  bilinear  versions  of  our  formulation 
in  terms  of  a  probabilistic  graphical  model  and  a  kernel. 

3  Interpreting  the  model 

3.1  A  probabilistic  interpretation  of  the  linear  model 

Let  us  first  construct  a  graphical  model  which  gives  a  probabilistic  interpretation  to  the  training  and 
classification  criteria  given  above  when  using  a  linear  model  with  zero  bias  (no  constant  term)  on 
the  coefficients — that  is,  gi{x,a.,6)  =  wfa.  This  model  consists  of  the  following  components 
(Eigure  1): 

•  The  matrices  D  and  W  are  parameters  of  the  problem,  with  a  Gaussian  prior  on  W,  p(W)  (x 

g-A2||w||2^  and  on  the  columns  of  D,  p(D)  oc  where  the  7;’s  are  the  Gaussian 

parameters.  All  the  dfs  are  considered  independent  of  each  other. 

•  The  coefficients  cxj  are  latent  variables  with  a  Laplace  prior,  p{ckj)  oc 

•  The  signals  Xj  are  generated  according  to  a  Gaussian  probability  distribution  conditioned  on  D 

and  OLj,p{xj\a.j,T>)  oc  All  the  ’s  are  considered  independent  from  each  other. 

•  The  labels  pj  are  generated  according  to  a  probability  distribution  conditioned  on  W  and  cXj, 

and  given  by  p{yj  =  i|Q:j,W)  =  Given  D  and  W,  all  the  triplets 

{a.j,Xj,yj)  are  independent. 
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What  is  commonly  called  “generative  training”  in  the  literature  (e.g.,  [11,  12]),  amounts  to  finding 
the  maximum  likelihood  for  D  and  W  according  to  the  joint  distribution  p{{xj,yj}JL-i^,  D,  W), 
where  the  Xj’s  and  the  yj’s  are  respectively  the  training  signals  and  their  labels.  It  can  easily  be 
shown  (details  omitted  due  to  space  limitations)  that  there  is  an  equivalence  between  this  generative 
training  and  our  formulation  in  Eq.  (5)  under  MAP  approximations.^  Although  joint  generative 
modeling  of  x  and  y  through  a  shared  representation,  e.g.,  [9],  has  shown  great  promise,  we  show 
in  this  paper  that  a  more  discriminative  approach  is  desirable.  “Discriminative  training”  is  slightly 
different  and  amounts  to  maximizing p{{yj}JLi,  D,  'W\{xj}JLi)  with  respect  to  D  and  W:  Given 
some  input  data,  one  finds  the  best  parameters  that  will  predict  the  labels  of  the  data.  The  same  kind 
of  MAP  approximation  relates  this  discriminative  training  formulation  to  the  discriminative  model 
of  Eq.  (6)  (again,  details  omitted  due  to  space  limitations).  The  mixed  approach  from  Eq.  (7)  is  a 
classical  trade-off  between  generative  and  discriminative  (e.g.,  [11,  12]),  where  generative  compo¬ 
nents  are  often  added  to  discriminative  frameworks  to  add  robustness,  e.g.,  to  noise  and  occlusions 
(see  examples  of  this  for  the  model  in  [8]). 

3.2  A  kernel  interpretation  of  the  bilinear  model 

Our  bilinear  model  with  gi{x,  a,  0)  =  x'^'W lO.  -f  bi  does  not  admit  a  straightforward  probabilistic 
interpretation.  On  the  other  hand,  it  can  easily  be  interpreted  in  terms  of  kernels:  Given  two  signals 
xi  and  X2,  with  coefficients  oti  and  a.2,  using  the  kernel  K{xi,X2)  =  aJ'o:2xfx2  in  a  logistic 
regression  classifier  amounfs  to  finding  a  decision  function  of  the  same  form  as  (ii).  It  is  a  product 
of  two  linear  kernels,  one  on  the  a’s  and  one  on  the  input  signals  x.  Interestingly,  Raina  et  al.  [5] 
learn  a  dictionary  adapted  to  reconstruction  on  a  training  set,  then  train  an  SVM  a  posteriori  on 
the  decomposition  coefficients  a.  They  derive  and  use  a  Eisher  kernel,  which  can  be  written  as 
K'{xi,X2)  =  OL^a2^'iY2  in  this  setting,  where  the  r’s  are  the  residuals  of  the  decompositions. 
Experimentally,  we  have  observed  that  the  kernel  K,  where  the  signals  x  replace  the  residuals  r, 
generally  yields  a  level  of  performance  similar  to  K' ,  and  often  actually  does  better  when  the  number 
of  training  samples  is  small  or  the  data  are  noisy. 


4  Optimization  procedure 


Classical  dictionary  learning  techniques  (e.g.,  [1,  5,  19]),  address  the  problem  of  learning  a  recon¬ 
structive  dictionary  D  in  R"  ^  *  well  adapted  to  a  training  set  T  as 

+  Ai||ajj|i,  (8) 

’  jeT 


which  is  not  jointly  convex  in  (D,  a),  but  convex  with  respect  to  each  unknown  when  the  other  one 
is  fixed.  This  is  why  block  coordinate  descent  on  D  and  a  performs  reasonably  well  [1,  5,  19], 
although  not  necessarily  providing  the  global  optimum.  Training  when  p  =  0  (generative  case),  i.e., 
from  Eq.  (5),  enjoys  similar  properties  and  can  be  addressed  with  the  same  optimization  procedure. 
Equation  (5)  can  be  rewritten  as: 


mm 
13,0, a 


s.t.  V  z  =  1, . . . ,  A:, 


<  1. 


(9) 


j&Ti 


Block  coordinate  descent  consists  therefore  of  iterating  between  supervised  sparse  coding,  where  D 
and  0  are  fixed  and  one  optimizes  with  respect  to  the  a’s  and  supervised  dictionary  update,  where 
the  coefficients  otj ’s  are  fixed,  but  D  and  9  are  updated.  Details  on  how  to  solve  these  two  problems 
are  given  in  Section  4.1  and  4.2. 


The  discriminative  version  of  SDL  from  Eq.  (6)  is  more  problematic.  The  minimization  of  the 
term  Ci{{Si{a.ji,  Xj,T),  6)}f^^)  with  respect  to  D  and  9  when  the  ct^/’s  are  fixed,  is  not  convex  in 
general,  and  does  not  necessarily  decrease  the  first  term  of  Eq.  (6),  i.e.,  D,  To 

reach  a  local  minimum  for  this  difficult  problem,  we  have  chosen  a  continuation  method,  starting 
from  the  generative  case  and  ending  with  the  discriminative  one  as  in  [6].  The  algorithm  is  presented 
on  Eigure  2,  and  details  on  the  hyperparameters’  settings  are  given  in  Section  5. 


4.1  Supervised  sparse  coding 

The  supervised  sparse  coding  problem  from  Eq.  (10)  (D  and  9  are  fixed  in  fhis  sfep),  amounfs  fo 
minimizing  a  convex  function  under  an  f  i  penalty.  The  fixed-point  continuation  method  (EEC)  from 

^We  are  also  investigating  how  to  properly  estimate  D  by  marginalizing  over  a  instead  of  maximizing  with 
respect  to  that  parameter. 
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Input:  p  (number  of  classes);  n  (signal  dimensions);  (training  signals);  k  (size  of  the 

dictionary);  Aq,  Ai,  A2  (parameters);  0  <  /Xi  <  /i2  <  . . .  <  Pm  <  1  (increasing  sequence); 
Output:  D  G  (dictionary);  9  (parameters). 

Initialization:  Set  D  to  a  random  Gaussian  matrix.  Set  6  to  zero. 

Loop:  For/X  =  Pi,..  .  ,pLrn, 

Loop:  Repeat  until  convergence  (or  a  fixed  number  of  iterations), 

•  Supervised  sparse  coding:  Compute,  for  alH  =  1, . . . , p,  all  j  in  Ti,  and  all  (  =  1, ...  ,p, 

a.*i  =  arginmSi{oi,Xj,'D,0).  (10) 

aeR'” 

•  Dictionary  update:  Solve,  under  the  constraint  I|d/Il  <  1  for  all  /  =  1, . . . ,  A: 

p 

(XI  XI  +  {1  -  p)S,{a*„Xj,'D,e)  \  +  A2||6>||^.  (11) 

i=l  j^Tj _ 

Figure  2:  SDL:  Supervised  dictionary  learning  algorithm. 


[17]  achieves  good  results  in  terms  of  convergence  speed  for  this  class  of  problems.  For  our  specific 
problem,  denoting  by  /  the  convex  function  to  minimize,  this  method  only  requires  V/  and  a  bound 
on  the  spectral  norm  of  its  Hessian  TCf.  Since  the  we  have  chosen  models  gi  in  Eq.  (10)  which  are 
linear  in  a,  there  exists,  for  each  signal  x  to  be  sparsely  represented,  a  matrix  A  in  and  a 

vector  b  in  such  that 

J  /(a)  =  Ci(A^a  +  b)  +  Aolltc  —  Dq;||2, 

\  \/f{a)  =  AVC,{A^a  +  b)  -  2AoD^(a:  -  Da), 

and  it  can  be  shown  that,  if  |  |U|  I2  denotes  the  spectral  norm  of  a  matrix  U  (which  is  the  magnitude 
of  its  largest  eigenvalue),  then  |  |7f/|  I2  <  (1  —  ^)|  |A^A|  ^  +  2Ao|  |D^D|  I2.  In  the  case  where  p  =  2 
(only  two  classes),  we  can  obtain  a  tighter  hound,  I  |7fy(a)|  I2  <  a)-C2(A  a)  1 ||2  _|_ 

2Ao|  |D^D|  I2,  where  ai  and  a2  are  the  first  and  second  columns  of  A. 


4.2  Dictiouary  update 


The  problem  of  updating  D  and  0  in  Eq.  (1 1)  is  not  convex  in  general  (except  when  /x  is  close  to  0), 
but  a  local  minimum  can  be  obtained  using  projected  gradient  descent  (as  in  the  general  literature 
on  dictionary  learning,  this  local  minimum  has  experimentally  been  found  to  be  good  enough  for 
our  formulation).  Denoting  i7(D,  9)  the  function  we  want  to  minimize  in  Eq.  (11),  we  just  need  the 
partial  derivatives  of  E  with  respect  to  D  and  the  parameters  9.  Details  when  using  the  linear  model 
for  the  a’s,  gi{x,  a,  0)  =  wf  a  +  bi,  and  9  =  {W  G  b  G  M^},  are 


dE 

ffD 

dE 

dE 

Ob 


p  p 

-2^0  (X  X  “  Da*,)a*f), 

i=i  jeTi  1=1 

(W^a*  +  b), 

z— 1  j^Ti  1—1 

jGTi  1  =  1 


(12) 


where 

<-^jl  =  ajj,  D,  +  (1  — /x)l;— j.  (13) 

Partial  derivatives  when  using  our  model  with  the  bilinear  models  gi{x,  a,  9)  =  tc^W^a  +  bi  are 
not  given  in  this  paper  because  of  space  limitations. 


5  Experimental  validation 

We  compare  in  this  section  a  reconstructive  approach,  dubbed  REC,  which  consists  of  learning  a 
reconstructive  dictionary  D  as  in  [5]  and  then  learning  the  parameters  9  a  posteriori;  SDL  with 
generative  training  (dubbed  SDL-G);  and  SDL  with  discriminative  learning  (dubbed  SDL-D).  We 
also  compare  the  performance  of  the  linear  (L)  and  bilinear  (BL)  models. 
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REC  L 

SDL-G  L 

SDL-D  L 

REC  BL 

k-NN,  £2 

SVM-Gauss 

MNIST 

4.33 

3.56 

1.05 

3.41 

5.0 

1.4 

USPS 

6.83 

6.67 

3.54 

4.38 

5.2 

4.2 

Table  1:  Error  rates  on  MNIST  and  USES  datasets  in  percents  from  the  REC,  SDL-G  L  and  SDL-D 
L  approaches,  compared  with  k-nearest  neighbor  and  SVM  with  a  Gaussian  kernel  [20]. 


Before  presenting  experimental  results,  let  us  briefly  discuss  the  choice  of  the  five  model  parameters 
Ao,  Ai,  A2,  /X  and  k  (size  of  the  dictionary).  Tuning  all  of  them  using  cross-validation  is  cumbersome 
and  unnecessary  since  some  simple  choices  can  be  made,  some  of  which  can  be  done  sequentially. 
We  define  first  the  sparsity  parameter  k  =  ^,  which  dictates  how  sparse  the  decompositions  are. 
When  the  input  data  points  have  unit  £2  norm,  choosing  n  =  0.15  was  empirically  found  to  be 
a  good  choice.  For  reconstructive  tasks,  a  typical  value  often  used  in  the  literature  (e.g.,  [19])  is 
k  =  256  Nevertheless,  for  discriminative  tasks,  increasing  the  number  of  parameters  is  likely  to 
allow  overfitting,  and  smaller  values  like  A:  =  64  or  fc  =  32  are  preferred.  The  scalar  A2  is  a 
regularization  parameter  for  preventing  the  model  to  overfit  the  input  data.  As  in  logistic  regression 
or  support  vector  machines,  this  parameter  is  crucial  when  the  number  of  training  samples  is  small. 
Performing  cross  validation  with  the  fast  method  REC  quickly  provides  a  reasonable  value  for  this 
parameter,  which  can  be  used  afterward  for  SDL-G  or  SDL-D. 

Once  K,  k  and  A2  are  chosen,  let  us  see  how  to  find  Aq.  Aq  plays  the  important  role  of  controlling  the 
trade-off  between  reconstruction  and  discrimination  in  Eq.  (3).  First,  we  perform  cross-validation 
for  a  few  iterations  with  /i  =  0  to  find  a  good  value  for  SDL-G.  Then,  a  scale  factor  making  the 
S*’s  discriminative  for  p  >  0  can  be  chosen  during  the  optimization  process:  Given  a  set  of  iS*’s, 
one  can  compute  a  scale  factor  7  such  that  7  =  arg  min.^  ({75*(a^„D,W)}).  We 

therefore  propose  the  following  strategy,  which  has  proven  to  be  efficient  during  our  experiments: 
Starting  from  small  values  for  Aq  and  a  fixed  n,  we  apply  the  algorithm  in  Figure  2,  and  after  a 
supervised  sparse  coding  step,  we  compute  the  best  scale  factor  7,  and  replace  Aq  and  Ai  by  7A0 
and  7A1.  Typically,  applying  this  procedure  during  the  first  10  iterations  has  proven  to  lead  to 
reasonable  values  for  these  parameters. 

Since  we  are  following  a  continuation  path  starting  from  /x  =  0  to  /r  =  1,  the  optimal  value  of  /x  is 
found  along  the  path  by  measuring  the  classification  performance  of  the  model  on  a  validation  set 
during  the  optimization. 

5.1  Digits  recognition 

In  this  section,  we  present  experiments  on  the  popular  MNIST  [20]  and  USPS  handwritten  digit 
datasets.  MNIST  is  composed  of  70  000  images  of  28  x  28  pixels,  60  000  for  training,  10  000  for 
testing,  each  of  them  containing  a  handwritten  digit.  USPS  is  composed  of  7291  training  images 
and  2007  test  images.  As  it  is  often  done  in  classification,  we  have  chosen  to  learn  pairwise  binary 
classifiers,  one  for  each  pair  of  digits.  Although  we  have  presented  a  multiclass  framework,  pairwise 
binary  classifiers  have  proven  to  offer  a  slightly  better  performance  in  practice.  Five-fold  cross  vali¬ 
dation  has  been  performed  to  find  the  best  pair  (fc,  k).  The  tested  values  for  k  are  {24,  32, 48,  64,  96}, 
and  for  k,  (0.13, 0.14, 0.15, 0.16, 0.17}.  Then,  we  have  kept  the  three  best  pairs  of  parameters  and 
used  them  to  train  three  sets  of  pairwise  classifiers.  For  a  given  patch  x,  the  test  procedure  consists 
of  selecting  the  class  which  receives  the  most  votes  from  the  pairwise  classifiers.  All  the  other  pa¬ 
rameters  are  obtained  using  the  procedure  explained  above.  Classification  results  are  presented  on 
Table  1  when  using  the  linear  model.  We  see  that  for  the  linear  model  L,  SDL-D  L  performs  the 
best.  REC  BL  offers  a  larger  feature  space  and  performs  better  than  REC  L.  Nevertheless,  we  have 
observed  no  gain  by  using  SDL-G  BL  or  SDL-D  BL  instead  of  REC  BL.  Since  the  linear  model 
is  already  performing  very  well,  one  side  effect  of  using  BL  instead  of  L  is  to  increase  the  number 
of  free  parameters  and  thus  to  cause  overfitting.  Note  that  the  best  error  rates  published  on  these 
datasets  (without  any  modification  of  the  training  set)  are  0.60%  [18]  for  MNIST  and  2.4%  [21]  for 
USPS,  using  methods  tailored  to  these  tasks,  whereas  ours  is  generic  and  has  not  been  tuned  to  the 
handwritten  digit  classification  domain. 

The  purpose  of  our  second  experiment  is  not  to  measure  the  raw  performance  of  our  algorithm,  but 
to  answer  the  question  “are  the  obtained  dictionaries  D  discriminative  per  se  or  is  the  pair  (T),9) 
discriminative?” .  To  do  so,  we  have  trained  on  the  USPS  dataset  10  binary  classifiers,  one  per  digit 
in  a  one  vs  all  fashion  on  the  training  set.  For  a  given  value  of  /x,  we  obtain  10  dictionaries  D  and 
10  sets  of  parameters  6,  learned  by  the  SDL-D  L  model. 
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M 

REC  L 

SDL-G  L 

SDL-D  L 

REC  BL 

SDL-G  BL 

SDL-D  BL 

Gain 

300 

48.84 

47.34 

44.84 

26.34 

26.34 

26.34 

0% 

1500 

46.8 

46.3 

42 

22.7 

22.3 

22.3 

2% 

3000 

45.17 

45.1 

40.6 

21.99 

21.22 

21.22 

4% 

6000 

45.71 

43.68 

39.77 

19.77 

18.75 

18.61 

6% 

15000 

47.54 

46.15 

38.99 

18.2 

17.26 

15.48 

15% 

30000 

47.28 

45.1 

38.3 

18.99 

16.84 

14.26 

25% 

Table  2:  Error  rates  for  the  texture  classification  task  using  various  frameworks  and  sizes  M  of 
training  set.  The  last  column  indicates  the  gain  between  the  error  rate  of  REC  BL  and  SDL-D  BL. 


Figure  3:  On  the  left,  a  reconstructive  and  a  discriminative  dictionary.  On  the  right,  average  error 
rate  in  percents  obtained  by  our  dictionaries  learned  in  a  discriminative  framework  (SDL-D  L)  for 
various  values  of  /i,  when  used  in  used  at  test  time  in  a  reconstructive  framework  (REC-L). 


To  evaluate  the  discriminative  power  of  the  dictionaries  D,  we  discard  the  learned  parameters  0  and 
use  the  dictionaries  as  if  they  had  been  learned  in  a  reconstructive  REC  model:  For  each  dictionary, 
we  decompose  each  image  from  the  training  set  by  solving  the  simple  sparse  reconstruction  problem 
from  Eq.  (1)  instead  of  using  supervised  sparse  coding.  This  provides  us  with  some  coefficients  a, 
which  we  use  as  features  in  a  linear  SVM.  Repeating  the  sparse  decomposition  procedure  on  the 
test  set  permits  us  to  evaluate  the  performance  of  these  learned  linear  SVM.  We  plot  the  average 
error  rate  of  these  classifiers  on  Figure  3  for  each  value  of  /r.  We  see  that  using  the  dictionaries 
obtained  with  discrimative  learning  (/i  >  0,  SDL-D  L)  dramatically  improves  the  performance  of 
the  basic  linear  classifier  learned  a  posteriori  on  the  a’s,  showing  that  our  learned  dictionaries  are 
discriminative  per  se.  Figure  3  shows  a  dictionary  adapted  to  the  reconstruction  of  the  MNIST 
dataset  and  a  discriminative  one,  adapted  to  “9  vs  all”. 

5.2  Texture  classification 

In  the  digit  recognition  task,  our  BL  bilinear  framework  did  not  perform  better  than  L  and  we  believe 
that  one  of  the  main  reasons  is  due  to  the  simplicity  of  the  task,  where  a  linear  model  is  rich  enough. 
The  purpose  of  our  next  experiment  is  to  answer  the  question  “When  is  BL  worth  using?”.  We  have 
chosen  to  consider  two  texture  images  from  the  Brodatz  dataset,  presented  in  Figure  4,  and  to  build 
two  classes,  composed  of  12  x  12  patches  taken  from  these  two  textures.  We  have  compared  the 
classification  performance  of  all  our  methods,  including  BL,  for  a  dictionary  of  size  A:  =  64  and 
K  =  0.15.  The  training  set  was  composed  of  patches  from  the  left  half  of  each  texture  and  the  test 
sets  of  patches  from  the  right  half,  so  that  there  is  no  overlap  between  them  in  the  training  and  test 
set.  Error  rates  are  reported  for  varying  sizes  of  the  training  set.  This  experiment  shows  that  in 
some  cases,  the  linear  model  completely  fails  and  BL  is  necessary.  Discrimination  helps  especially 
when  the  size  of  the  training  set  is  particularly  valuable  for  large  training  sets.  Note  that  we  did  not 
perform  any  cross-validation  to  optimize  the  parameters  k  and  k  for  this  experiment.  Dictionaries 
obtained  with  REC  and  SDL-D  BL  are  presented  in  Figure  4.  Note  that  though  they  are  visually 
quite  similar,  they  lead  to  very  different  performance. 

6  Conclusion 

We  have  introduced  in  this  paper  a  discriminative  approach  to  supervised  dictionary  learning  that 
effectively  exploits  the  corresponding  sparse  signal  decompositions  in  image  classification  tasks, 
and  affords  an  effective  method  for  learning  a  shared  dictionary  and  multiple  (linear  or  bilinear) 
models.  Future  work  will  be  devoted  to  adapting  the  proposed  framework  to  shift-invariant  models 
that  are  standard  in  image  processing  tasks,  but  not  readily  generalized  to  the  sparse  dictionary 
learning  setting.  We  are  also  investigating  extensions  to  unsupervised  and  semi-supervised  learning 
and  applications  into  natural  image  classification. 
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Figure  4:  Left:  test  textures.  Right:  reconstructive  and  discriminative  dictionaries 
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